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Abstract 

In many applications, such as economics, operations research and reinforcement learning, one 
often needs to estimate a multivariate regression function / subject to a convexity constraint. 
For example, in sequential decision processes the value of a state under optimal subsequent 
decisions may be known to be convex or concave. We propose a new Bayesian nonparametric 
^yy multivariate approach based on characterizing the unknown regression function as the max 

of a random collection of unknown hyperplanes. This specification induces a prior with large 
support in a Kullback-Leibler sense on the space of convex functions, while also leading to strong 
posterior consistency. Although we assume that / is defined over R p , we show that this model 
has a convergence rate of log(n) 1 n 1 /( d + 2 ) under the empirical L2 norm when / actually maps 
a d dimensional linear subspace to R. We design an efficient reversible jump MCMC algorithm 
for posterior computation and demonstrate the methods through application to value function 
approximation. 

1 Introduction 

> 

(S| Consider the problem of estimating the function / for the model 

y = /(x) + e, 

where x e X C W, y e R, / : W -»■ R is a mean regression function and e ~ 7V(0,cr 2 ). Given the 
observations (xi, yi), . . . , (x n , y n ), we would like to estimate / subject to the convexity constraint, 

/(x 1 )>/(x 2 ) + V/(x 1 ) T (x 1 -x 2 ), (1) 

for every xi,X2 E X, where V/(x) is the gradient of / at x. This is called the convex regression 
problem. Convex regression can easily be modified to allow concave regression by multiplying all 
of the values by negative one. 

Convex regression problems are common in economics, operations research and reinforcement 



learning. In economics, production functions (Skiba 1978) and consumer preferences (Meyer & 



Pratt 1968) are often convex, while in operations research and reinforcement learning, value func- 



tions for stochastic optimization problems can be convex (Shapiro et al. 2009). If a problem is 
known to be convex, a convex regression estimate provides advantages over an unrestricted esti- 
mate. First, convexity is a powerful regularizer: it places strong conditions on the derivatives — and 
hence smoothness — of a function. Convexity constraints can substantially reduce overfitting and 
lead to more accurate predictions. Second, maintaining convexity allows the use of convex opti- 
mization solvers when the regression estimate is used in an objective function of an optimization 
problem. 
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Multivariate convex regression has received relatively little attention in the literature. The oldest 
method is the least squares estimator (LSE) ( |Hildreth|l954 , Dykstra|[l983 , Boyd &; Vandenberghe 
20041 |Seijo fcSen||20nt , 



mm 

yi:n,gl:n 



(2) 



2=1 



subject to yj >w + gf (xj - x*), i, j = 1, . . . , n. 



The resulting function is piecewise linear, generated by taking the maximum over the supporting 
hyperplanes, gi :n . However, Equation ^ has n 2 constraints, making solution infeasible for more 
than a few thousand observations. Recently, there has been interest in multivariate convex regres- 



sion beyond the LSE. |Henderson fc Parmeter (2009) proposed a method that generates a regression 



estimator via a weighted kernel estimate subject to conditions on the Hessian of the estimator; so- 
lutions are found using sequential quadratic programming. Convexity is guaranteed only at points 
where the Hessian condition is enforced and the method does not scale well to high dimensions 
or large datasets. Hannah & Dunson (2011) proposed a method, Convex Adaptive Partitioning 



(CAP), that adaptively splits the dataset and fits linear estimates within each of the subsets. Like 
the least squares estimator, the CAP estimator is formed by taking the maximum over hyperplanes; 
unlike previous methods, it produces a sparse estimator that scales well to large datasets and large 
numbers of covariates. However, it has theoretical guarantees only in the univariate case. 

Piecewise planar models, like the LSE and CAP, are poor when used in the objective function 
of an optimization problem. The minima of piecewise planar functions occur at a vertex where 
p+1 hyperplanes intersect. The location of vertices is sensitive to the number of hyperplanes and 
the hyperplane parameters. The parameters are in turn sensitive to noise and observation design. 
Bayesian models could reduce these problems: prior distributions on parameters reduce design 
sensitivity and model averaging produces a smoother estimate. 

Bayesian models have been used for convex regression, but only in the univariate case. In this 
setting, methods rely on the ordering implicit to the real line: a positive semi-definite Hessian trans- 



lates into an increasing derivative function in one dimension. Ramgopal et al. (1993) discretized the 



covariate space and placed a Dirichlet prior over the normalized integral of the slope parameters 



between those points. Chang et al. (2007) used Bernstein polynomials as a basis by placing a prior 



on the number of polynomials and then sampling from a restricted set of coefficients. Shively et al. 



(2011) used fixed knot and free-knot splines with a prior that placed an order restriction on the 



coefficients for each basis function. In a single dimension, Bayesian convex regression is closely re- 
lated to Bayesian isotonic regression ( Lavine fc Mockus]|1995 , Neelon fc Dunson||2004 , Shively et al 
2009). In multiple dimensions, however, convexity constraints become combinatorially difficult to 



enforce through projections. 

We take an entirely different approach to modeling convex functions. Instead of creating an 
estimator based on a set of restricted parameters or projecting an unconstrained estimate back into 
the space of convex functions, we place a prior over a smaller set of functions that are guaranteed 
to be convex: piecewise planar functions. The number of hyperplanes and their parameters are 
random; we define the function to be the maximum over the set of hyperplanes. We efficiently 
sample from the posterior distribution with reversible jump Markov chain Monte Carlo (RJMCMC). 
We call this approach Multivariate Bayesian Convex Regression (MBCR). Although the set of 
piecewise planar functions does not include all convex functions, it is dense over that space and we 
show strong (L\) consistency for MBCR. If /(x) = g(Ax) for some d x p matrix A and function g, 
we show convergence rates for MBCR with respect to the L2 norm to be log(n) _1 n _1 /^ +2 ^. The 
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dimension of the linear subspace, d, determines the convergence rate, not the dimension of the full 
space, p. 

In numerical experiments, we show that MBCR produces estimates that are competitive with 
LSE and CAP in terms of traditional metrics, like mean squared error, and can outperform them 
in objective function approximation. Through examples on toy problems, we show that MBCR 
has the potential to produce regression estimates that are much better suited to objective function 
approximation than piecewise planar methods. 



2 Multivariate Bayesian Convex Regression 

Convexity is defined by the set of supporting hyperplane constraints in Equation ([!]): any supporting 
hyperplane of the function / at xi is less than or equal to /(X2) at any other point X2. This is 
equivalent to / having a positive semi-definite Hessian. In multiple dimensions, it is difficult to 
project onto the set of functions that satisfy these constraints. Instead of placing a prior over an 
unconstrained set of functions and then restricting the parameters to meet convexity conditions, 
we place a prior over a smaller set of functions that automatically meet the conditions. Specifically, 
for all x in a compact set X we place a prior over all functions that are the maximum over a set of 
K hyperplanes, (a i? . . . , (a K , Pk) € 

/(x) = max a k + /3^x, (3) 
ke{i,...,K} 

where K is unknown. This set of functions can approximate any convex function / arbitrarily well 
while maintaining straightforward inference. 
Assuming /(x) follows Equation Q, we let 

Y i = f(^ i ;9) + e i , e; ~ N(0, a 2 ), (4) 

where the unknown parameters are 

9 = {K, a = (a u . . . , a K ) T , (3 = (ft,..., f3 K ) T , a 2 }. 

The prior II over {K, a, /3, a 2 } is factored as, 

K 

IL(K, a, /3, a 2 ) = U a (a 2 )U K (K) f] IL e (a k , f3 k ). 

k=i 

The prior for the variance parameter, cr 2 , is defined as 11^, and the prior for the number of hyper- 
planes, K, is Hk- The hyperplane parameters, 9 k = (otkiPk)' are given the prior n#. These yield 
the model, 

K~Il K , cT 2 ^U a , 9 k \K~IL e , k = l,...,K. 



MBCR is similar to Bayesian adaptive regression spline (BARS) models (Denison et al. 1998 



DiMatteo et al. 2001, S hively et al.||2009 , 2011) in that the method places a prior over a finite set 



of locally parametric models, with the prior accommodating uncertainty in the number of models, 
their locations and their parameters. Indeed, we use the same inference method: reversible jump 
Markov chain Monte Carlo (RJMCMC). In both cases, RJMCMC works by adaptively adding 
and removing local models while updating the model-specific parameters. However, while BARS 
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explicitly introduces random changepoints or knots within a region, in MBCR regions are implic- 
itly defined as corresponding to locations across which a particular hyperplane dominates. Let 
{Ai, . . . , Ak) be a partition of X where 



Ak = {x E X : k — arg max a 7 - + Ba x}. 

je{i,...,K} 3 



As in the local knot search of |DiMatteo et ajj ( |2001| ), we use these regions to produce an efficient 
proposal distribution for the RJMCMC. We discuss implementation details for MBCR in Section 
[4j but first we show consistency and rate of convergence for MBCR in Section |3j 



3 Theoretical Results 



Posterior consistency occurs if the posterior assigns probability converging to one in arbitrarily small 
neighborhoods of the true function fo as the number of samples n grows. The rate of convergence 
is the rate at which the neighborhood size can contract with respect to n while still maintaining 
consistency. Despite the longstanding interest in shape-restricted estimators, relatively little work 
has explored their asymptotic properties — particularly in multivariate and Bayesian settings. In 
the frequentist fra mework, |Hanson fc Pledger ( |1976| ) showed consistency of the univariate LSE for 



convex regression; 



Groeneboom et al. (2001) showed it has a local convergence rate of n 2 / 5 . More 



recently, Seijo & Sen (2011) showed consistency for the multivariate LSE. 



There is also a recent literature on the related topic of multivariate convex-transformed density 
estimation. Cule et al. (2010) showed consistency for the MLE log-concave density estimator; 



Seregin & Wellner (2010) showed consistency for the MLE of convex-transformed density estimators 



and gave a lower minimax bound on the convergence rate of n 2 /(p+ 4 ). Bayesian shape- restricted 



asymptotics have received even less attention. Shively et al. (2009) showed consistency for monotone 



regression estimation with free knot splines in the univariate case; this was extended to univariate 



convex regression estimation by Shively et al. (2011) 



Let 9 E 6 be the set of parameters to be estimated. Let II be the prior induced on / by 



K — 1 rsj Poisson(X), 



(P 1 r^j 



0u I K ~ Ho 



k 



(5) 



a 



< e 



where 11^ is defined in Assumption B2 and Hq in Assumptions B3 and B4. We consider strong, 
or Li, consistency. That is, let 

L e =|(/,a): jT |/(x)-/ (x)|dx<e, 

where the data-generating model is 

Yi = /o(x;) + e;, e< ~ N(0, a ) 2 . 

We would like IL(L? \(Xi, Yi)? =1 ) -> as n —± oo, almost surely FJ^ aQ , where aQ is the product 
measure under the true distribution. Throughout the rest of this paper, we use lower case and yi 
to denote known or observed quantities, while and Y{ denote random variables. We show that 
MBCR is strongly consistent under a general set of conditions. 

Bayesian rates of convergence are slightly different from their frequentist counterparts. A series 
( 6 n)^Li where e n — >► is a rate of convergence under a metric d(9, 9q) if 

P^ II((9 G O : d(0,0 o ) > H n e n \(Xi,Yi)? =1 ) -> 

for every H n — > oo. We examine convergence rates with respect to the empirical L2 norm. Moreover, 
if fo actually maps a d-dimensional linear subspace of MP to R, then the convergence rate is 
determined by the dimensionality of the subspace, d, rather than the full dimensionality, p. 
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3.1 Consistency 

We consider two design cases for consistency: fixed design and random design. We place a series of 
assumptions on the true function, the prior and the design. Some of the assumptions on the prior 
are specific to the design type. In both cases, we assume that fo is uniformly bounded: 

Bl. The function fo is uniformly bounded on the compact set X. 

Without loss of generality, we assume that X = [0, 1} P . 

For both design types, we need define the prior 11^ and Hq in Equation First, we assume that 
the prior on a 2 has compact support bounded away from zero. This is not a restrictive assumption 
in practice since zero measurement error is unlikely to occur and an upper bound can be easily 
chosen to cover a wide range of plausible values. Second, in the case of fixed design, we assume 
compact support of the prior for the hyperplane parameters; again, a wide range of plausible values 
can be chosen. Truncated normal and inverse-gamma distributions provide a convenient choice. 

B2. Let Ilcr be the prior on a; 11^ is non-atomic and only has support over [a, a] with < a < 
ao < a < oc. 

B3. Let Hq = N p j r \{ii a ^ 1 V a fi) be the prior on where N p +\ is the p + 1 dimensional Gaussian 
distribution. 

B4. Let = A/p + i(^ 5/ 3, V a ^). Let L be a constant such that L > ||^fr/o( x )||oo and for some 
V > ±>L, let 

ft = {(<*,/?) : max{a,/3i,...,/3 p }<y}. 

Set = n*(- n ft)/nj(«) and let e k ~ u e . 

For both design cases, we need to ensure that the covariate space is sufficiently well-sampled. 

B5. For each hypercube H in X, let X(H) be the Lebesgue measure. Suppose that there exists 
a constant K p with < K p < 1 such that whenever \(H) > 7^^, H contains at least one 
design point for sufficiently large n. 

B6. Let Q be the density of the random design points; Q is non-atomic and Q(x) > for every 

x e x. 

With these assumptions, we now give consistency results. 

Theorem 3.1. Assume that X is compact, the covariate design is fixed and that fo is convex with 
continuous first order partial derivatives. Suppose that conditions Bl, B2 ; B4 and B5 hold. Then 
for every e > 0, 

Pf O;ao Il(LC\Y 1 ,...,Y n ,x 1 ,...,x n )^0. 

In the stochastic design case, assumptions B4 and B5 are replaced by B3 and B6, respectively. 
We note that for random design, L\ convergence follows directly from convergence in probability 
for a uniformly bounded function /q- 

Theorem 3.2. Assume that X is compact, the covariate design is random and that fo is convex 
with continuous first order partial derivatives. Suppose that conditions B1-B3 and B6 hold. Then 
for every e > 0, 

Ff Otffo n(L°\Y 1 ,...,Y n ,X 1 ,...,X n )^0. 
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To prove Theorems |3.1| and |3.2 



we use the consistency results for Bayesian nonparametric 
regression of|Choi fc Schervish| (|2007[) . We show that the prior II satisfies the following assumptions, 



Al. 



The prior II puts positive measure on the neighborhood 

B s =!(f,a) : ||/-/o||oo<5, 



A2. 



for every 6 > 0. 
Set <d n Gin x 



and M n = 0(n° 



where 



e 



In 



with 2 < & < 1- 



<M n , 



_d_ 



f 



<M n ,j = l,...,p 



Then there exists Ci, ci > such that 11(0^) < Ci_e 



Choi &; Schervish ( |2007 ) modifies the consistency theorem of Schwartz] (1965) for non-i.i.d 



observations; the requirements are prior positivity on a set of variationally close Kullback-Leibler 
neighborhoods and the existence of exponentially consistent tests separating the desired posterior 
region from the rest. Assumption Al satisfies the prior positivity while assumption A2 constructs 
a sieve that is used to create the exponentially consistent tests. Assumptions Al and A2 generate 
pointwise convergence in the empirical and in-Q-probability metrics for the fixed and random design 
cases, respectively. Assumptions Bl to B6 are then used to extend consistency under these metrics 
to consistency under the L\ metric. See Appendix A for details. 



3.2 Rate of Convergence 

We determine the rate of convergence of MBCR with respect to the empirical L2 norm, 

/ n 

ii/ii»= -£ /(x<)2 

V i=i 

For both the fixed design and random design cases, we make the following assumptions: 
B7. The model variance ctq is known. 

B8. There exists a convex function go : R. d —> R and a matrix A E W xd with rank d where d < p 
such that /o( x ) — So (Ax). 

We make assumption B7 for convenience; it can be loosened with sufficient algebraic footwork. 
Assumption B8 says that fo actually lives on a d— dimensional subspace; this is not restrictive as 
it is possible for A = I p . However, many situations arise when d « p. For example, there may 
be extraneous covariates or the mean function may be a function of a linear combination of the 
covariates — in effect, d = 1. It is not required that A is known a priori, but simply that it exists. 
We also keep assumption B4, which truncates the tails of the Gaussian priors for the hyperplane 
slopes and intercepts; this is done to bound the prior probability of the compliment of the sieve. 

Theorem 3.3. Assume that X is compact and that fo is convex, has continuous first order partial 
derivatives and suppose that conditions Bl, B4 ; B7 and B8 hold. For both random covariates and 
fixed covariates and sufficiently large V , 

p%n(/ : ||/-/ ||„>fr„c„|yi,...,y„,xi,...,x n )->o 

for any H n — y oo, where e~ l = log(n) n 1 ^ d+2 \ 
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Theorem |3.3| is proven by showing that the conditions for Theorem 3 of |Ghosal fc van der Vaart 



(|2007j) are satisfied. Details are given in Appendix B. 

We note that the rates achieved in Theorem |3.3| are within a log term of global minimax 



rates for general nonparametric convergence, e ri 



n 



-V(p+2) 



assuming A 



LP' 



However, the 
oo metric scales 



e-metric entropy of the set of bounded con vex functions with respect to the | 
like e _p / 2 (van der Vaart & Wellner 1996), leaving open the possibility of convergence rates of 
e n = n -2 /( p+4 ^ for bounded convex functions. In certain settings of convex-transformed density 
estimation that rate has been obtained ( Seregin fc Wellner||2010 ). We, however, do not believe that 
MBCR achieves this rate in a general setting. 



4 Implementation 

In this section, we extend MBCR to a model that can accommodate heteroscedastic data and 
provide a reversible jump MCMC sampler. 



4.1 Heteroscedastic Model 

The model in Section [2] assumes a global variance parameter, a 2 . While this is often a reasonable 
assumption, it can lead to particularly poor results when it is violated in a shape-restricted setting: 
locally chasing outliers in high-variance regions can lead to globally poor prediction due to the 
highly constrained nature of convex regression. To accommodate heteroscedasticity, we consider 
the following model, 

Y i = f(xL i ;0) + e i , * ~ N(0, gfr)), 

where g : MP — > R+. Specifically, to induce a flexible prior on g, we introduce a separate variance 
term for each hyperplane and modify the model to let 

Yi = max a k + 0£*i + e h e i ~ N (°i a k)> 
ke{l,...,K} 

(Ok, °l) ~ Np+iIG(v>a,p, V a ,p, a, 6), k = 1, . . . , K, 
K — 1 Poisson(X). 

Here N p +\IG denotes the normal inverse gamma distribution with a p + 1 dimensional normal. 
We choose a Poisson prior for the number of components, although we note that the model is 
generally not sensitive to the prior on the number of components. Due to the adaptable nature of 
the heteroscedastic model and its resistance to variance misspeciflcation, we use it for all numerical 
work. 



4.2 Posterior Inference 



To sample from the posterior distribution, we use RJMCMC with the marginal posterior distribu- 
tion of {K,a, /3,cr 2 } as the stationary distribution. Similar methods have been used for posterior 
inference on free-knot spline models by Denison et al.| ( [1998 ) and DiMatteo et al. (2001). 

RJMCMC works by proposing a candidate model, {K*, a*, cr 2 *}, and determining whether 
or not to move to that new model based on a Metropolis-Hastings type acceptance probability, 

or* * a* 2*|t^ a 2x • L p(^|x,K*,a*,/3*,a 2 *) 
a(K ,a , p , <7 I K , a, p, a ) = mm < 1, — — — — ^ — (6) 



n(K*,a*,/3*,a 2 



' p(y|x,K,a,/3,a 2 ) 
g(K,a,/3,a 2 |K*,a*,/3*,a 2 *) 



I1(K, a, /3, a 2 ) q(K*, a*, a 2 * | K, a, /3, a 2 ) 
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Here p(Y | x, K*, a* , /?* ,a 2 *)/p(Y | x, K, a, p, a 2 ) is the likelihood ratio of the data conditioned on 
the models, U(K*,a*,/3*,a 2 *)/U(K,a,/3, a 2 ) is the prior ratio of the models and 

q(K, a, /3, a 2 | iT, a\ /T , a 2 *) /q(K*, a\ /T , a 2 * | a, /3, a 2 ) 

is an asymmetry correction for the proposal distribution. Candidate models are entirely new models: 
all parameters are updated as a block. If only individual parameters or hyperplanes are updated, 
acceptance rates for parameters in the most constrained areas are orders of magnitude lower than 
those in the relatively unconstrained regions on the boundary of the function. Without block 
updates, there is poor mixing. There are three types of candidate models: hyperplane relocations, 
deletions and additions. All candidate models are generated from proposal distributions, which 
significantly impact the efficiency of the RJMCMC algorithm. 

To generate proposal distributions we use the covariate partition induced by the current model 
{K, a, /3, a 2 } to create a set of basis regions. Basis regions are determined by partitioning the set 
of training data. For example, suppose that a partition of the observations, (xi,yi), . . . , (x n ,?/ n ), 
has K subsets. Let C = {Ci, . . . , Ck}, where Ck = {i : i in subset k) . We can use C to produce 
a set of basis regions for generating (a*,/3*) with K components, 

n*=(^J+xf fc] x [fc] )~\ (7) 

a k = a +Y> 

(at, ft,4*) ~ N P+1 IG (4, V k *,a* k ,b* k ), k = 1, . . . , K. 

Here, x^j = {[l 5 x i] : i ^ C/c} 5 Y[k] — {Vi '• i ^ Ck} and is the number of elements in subset k. 
The hyperparameters for the proposal distributions, t^g, a, 6), are not necessarily the same 

as those for the prior. Often the variance parameters are smaller to produce higher acceptance 
rates. The current set of hyperplanes, {K, are used to create the partitions that define the 

basis regions, 

Ck — { i ' k = arg max a 7 - + Bj x ? 

For a relocation proposal distribution, the K basis regions are generated by the covariate par- 
tition of the current model. The removal proposal distribution is a mixture with K components. 
Each component is generated by removing the hyperplane k for k = 1, . . . ,K and using the re- 
maining K — 1 hyperplanes to create a set of basis regions. Proposal distributions for additions 
are less straightforward. The addition proposal distribution is a mixture with KLM components. 
Beginning with the subsets defined by the current model, {K, a, /3, a 2 }, each subset j = 1, . . . , K 
is searched along a random direction m for m — 1, . . . , M. On each of those random directions, 
the subset j is divided according to a knot into the set of observations less than a\ in direction 
m and those greater. This is done for £ = 1, . . . , L knots for each subset j and direction m. An 
example is shown in Figure [T] Full implementation details are given in Appendix C. 

We note that the sampler for MBCR does not behave like a typical MCMC sampler. Conver- 
gence and mixing are extremely fast. Unlike most MCMC samplers, the MBCR sampler converges 
once the "right" number of components has been reached, typically within zero to four of the mean 
number of components. This is due to the way the proposal distributions are constructed and 



8 



c. 



c 2 



°o°cP 



(A) Original Partition 



(B) j=l 



(C) j=2 



q u q 1 ; 1 C^' 1 
• 



co 



o ° 



o oo 



oqO 



^2,1 ^.2,1 



>©© 

o oo 



C): 



C 1Z C 

1 + 2 



©© P t 



*Wg© 

o oo 



^2,2 ^2,2 

S- S + 



Figure 1: Basis regions for one covariate combination when L — 2 and K — 2. (A) shows the 
original partition; (B) shows the partition when the region induced by the first hyperplane is split; 
(C) shows the partition when the region induced by the second hyperplane is split. 

the strict requirements of convexity. Block updating ensures that autocorrelation drops to near 
zero rapidly. Numerical results suggest this generally happens after about three samples. While 
convexity endows the sampler with properties like fast convergence, it can also lead to situations 
where the restrictions are too rigid for the sampler to function. For example, if the noise level is 
very low, the number of observations is more than a few thousand, or the number of dimensions 
is moderate to high, the region of admissible models becomes very small and the acceptance rates 
rapidly drop to zero. Approximate inference methods seem to be required in these situations. 



5 Applications 



In Section |5.1[ we compare the performance of MBCR to other regression methods on a set of 
synthetic problems. We show that convexity constraints can produce better estimates than their 
unconstrained counterparts and that MBCR is competitive with state of the art convex regression 



methods with respect to mean squared error. In Section 5.2, we analyze the behavior of MBCR, 
CAP and LSE when approximating an objective function for convex optimization. We show that 
MBCR produces estimates that are more suited to objective function approximation than those 
produced by CAP or LSE. 



5.1 Synthetic Problems 

In this subsection, we create a set of synthetic problems designed to show off the strength of con- 
vexity constraints. Problem 1 is highly non-linear and has moderate dimensionality (5); Problems 
2 and 3 also have moderate dimensions in the covariate space (6 and 4, respectively), but both 
actually reside in a univariate subspace. 



Problem 1. Let x e R 5 . Set 

y = (xi + .5x2 + X3) 2 — X4 + .25^5 + e, 
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Table 1: Mean squared error on Problems 1, 2 and 3. 



Problem 1 


Method 


n = 100 


n = 200 


n = 500 


n = 1,000 


MBCR 


1.0373 


0.3679 


0.2784 


0.2180 


CAP 


1.6878 


1.5336 


0.3646 


0.1500 


LSE 


4.0174 


1.4370 


13.3398 


1.8434 


GP 


7.6612 


6.2974 


4.4793 


3.5518 


Problem 2 


Method 


n = 100 


n = 200 


n = 500 


n = 1,000 


MBCR 


0.0943 


0.0720 


0.0155 


0.0182 


CAP 


0.1191 


0.0887 


0.0205 


0.0129 


LSE 


4.6521 


2.8926 


1.9979 


5.4998 


GP 


0.3555 


0.3932 


0.3598 


0.2174 


Problem 3 


Method 


n = 100 


n = 200 


n = 500 


n= 1,000 


MBCR 


0.1399 


0.0775 


0.0138 


0.0102 


CAP 


0.1886 


0.1308 


0.0192 


0.0164 


LSE 


4.7537 


2.0210 


1.4801 


7.6638 


GP 


2.0351 


3.4649 


2.9349 


3.2026 



where e ~ ^V(0, 1). The covariates are drawn from a 5 dimensional standard Gaussian distribution, 
A 5 (0,I). 

Problem 2. Let x e I 6 . Set 

y=(xi + x 2 f + e, 

where e ~ A(0, .5 2 ). The covariates are drawn from a 6 dimensional uniform distribution, Xj ~ 
Unif [-1,1] for i = 1, . . . , 6. 



Problem 3. Let xel 4 . Set 



y 



I T I 

a x + e, 



[0.8262,0.9305,1.6361,0.6072] 



where e ~ A^(0, l 2 ). The covariates are drawn from a 4 dimensional uniform distribution, xj 
Unif [-4,4] for j = l,...,4. 



5.1.1 Results. 



On all of these problems, MBCR is compared to CAP, LSE and Gaussian Process priors (Rasmussen 



fc Williamsl|2006[ ) . Gaussian process priors are a Bayesian method that is widely and successfully 
used in regression and classification settings; we use the Matlab gpml package for implementation. 
All methods were implemented in Matlab; the least squares estimate (LSE) was found using the cvx 
optimization package. LSE took 5 to 6 minutes to run with 500 observations and 50 to 60 minutes 
to run with 1,000. The tolerance parameter for CAP was chosen through five- fold cross-validation. 
MBCR was implemented with component-specific variances. It was run for 1,000 iterations with 
the first 500 discarded as a burn-in. 

Due to the highly constrained nature of the model and block updating, convergence of the sam- 
pler was extremely fast in all settings. The model was generally insensitive to the hyperparameter 
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for the number of hyperplanes, A; it was varied over three orders of magnitude and set to 20 for all 
tests. In lower, dimensions, however, choice of A was more important. Likewise, variance hyperpa- 
rameters were tested over three orders of magnitude with little sensitivity. Distributions were not 
placed over the variance hyperparameters because of the delicate relationship between the proposal 
distributions and the hyperparameters. All mean hyperparameters were set to 0. 

MBCR and CAP dramatically outperformed other methods on all of the problems. LSE had 
relatively poor performance although it includes convexity constraints. This is due to overfitting, 
particularly in boundary regions. MBCR and CAP performed comparably on Problems 2 and 
3, which both reside in a univariate subspace of the general covariate space. However, MBCR 
outperformed CAP on the more complex Problem 1, particularly when there were few observations 
available. 



5.2 Objective Function Approximation for Stochastic Optimization 

Stochastic optimization methods are used to solve optimization problems with uncertain outcomes. 
The traditional objective is to minimize expected loss. There are many problems in this class 
ranging from stochastic search (Spall 2003) to sequential decision problems QSutton fc Barto||l998 



Powell|[2007 ). In this section, we study the use of convex regression to compute response surfaces 
A response surface is an approximation of an objective function based on a collection of noisy 
samples. Once a response surface has been created, it is searched to estimate the minimizer or 
maximizer of a function. Convex representations are desirable. First, the resulting approximation 
will likely be closer to the true objective function than an unconstrained approximation. Second, 
and more importantly, the surrogate objective function is now convex as well and can be easily 
searched with a commercial solver. 

Consider the following problem. We would like to minimize an unknown function /(x) with 
respect to x given n noisy observations, (x^,^)^ =1 , where yi — /(x^) + e 



minE{/(x)|(x;, yi )r =1 }. (8) 

To solve Equation Q, we approximate E{/(x) | (x^,^)^ =1 } with three different methods for 
regression: least squares, CAP and MBCR. Let / n (x) be the estimate of the mean function given 
(xiiVijiLi- Unlike CAP and LSE, MBCR is a Bayesian method; it places a distribution over 
functions rather than producing a single function estimate. Let /^^(x) be a sample from the 
posterior; the Bayes estimate of the mean function can be approximated by the average of M 
samples from the posterior, 

1 M 

m=l 

We demonstrate the empirical differences between the objective functions produced by MBCR, 
CAP and LSE by solving a small stochastic optimization problem. 



Example. Set 



Yi = XjQxf + ej, Q 



1 

0.2 



0.2 
1 



JV(0,0.1) 



(9) 



The constraint set is —1 < Xj < 1 for j = 1,2. Observations were sampled randomly from a 
uniform distribution, x^ ~ Uniform[—l, l] 2 . We used LSE, CAP and MBCR to approximate the 
objective function. To examine the stability of these methods for objective function approximation, 
we sampled 100 observations 50 times for Equation Q. Approximations of the objective functions 
for one sample are shown in Figure [2] 
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True Function 



LSE 




Figure 2: Objective functions for LSE, CAP and MBCR for Equation @ given 100 observations. 
Both LSE and CAP produce piecewise-linear functions; CAP produces a sparser function than 
LSE. MBCR averages over piecewise-linear functions to produce an estimate that is much closer to 
smooth. 




Figure 3: Minima from the objective functions created by LSE, CAP and MBCR for Equation Q 
given 100 observations; contours are from the true function. The observations were sampled 50 
times; selections made when the objective function was approximated by MBCR are much more 
concentrated around the true minimum than those chosen using LSE or CAP. 
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5.2.1 Results 



We compared MBCR, CAP and LSE across 50 samples of 100 observations. The minima of piece- 
wise planar models, like CAP and LSE, are on one of the vertices (or occasionally along one of 
the edges); this makes the minima of such models highly sensitive to model parameters such as 
number of hyperplanes and the value of their coefficients. MBCR, however, places a distribution 
over piecewise planar models. The Bayes estimate averages those models to produce something 
that is close to smooth and hence is relatively robust to observation design. Figure [2] highlights 
these differences. The minima of both piecewise planar methods were sensitive to the observation 
design while the minima of MBCR proved more robust. Locations of minima are shown in Figure 

El 



5.2.2 Discussion 

Many methods for solving stochastic optimization problems, including response surface meth- 



ods (Barton fc Meckesheimer||2006, Lim 2010), Q-learning (|Ernst et al. 2005) and approximate 



dynamic programming ( Powell||2007 ), involve functional approximations that are then searched to 
find a solution that minimizes or maximizes the approximate reward. Current solution methods for 



these problems use either unconstrained regression methods (Lagoudakis & Parr 



2005) or additive approximations with univariate convex functions (Powell et al. 



2003, Ernst et al. 



2004, Nascimento 



fc Powell||2009 ) . Robust multivariate convex regression methods could allow efficient solution of a 
broad set of stochastic optimization problems, inlcuding resource allocation, portfolio optimization 
and inventory management. 



6 Conclusions and Future Work 

In this article, we introduced a novel fully Bayesian, nonparametric model for multivariate convex 
regression and showed strong posterior consistency along with convergence rates. We presented an 
efficient RJMCMC sampler for posterior inference. Our model was used to approximate objective 
functions for stochastic optimization and showed improvement over existing frequentist methods. 

While this work represents a large advancement for convex regression, much remains to be done. 
First, we need to develop sampling methods that scale to large problems. Second, MBCR needs to 
be tested on a variety of stochastic optimization problems. Third, MBCR can be combined with 
other Bayesian methods to produce a class of semi-convex estimators. 

Currently, the RJMCMC sampling method only scales well to moderate dimensionality and 
problem size: its limits are about 8 to 10 dimensions and a few thousand observations. Approximate 
inference methods, such as variational Bayes, could allow MBCR to solve problems an order of 
magnitude larger. Implementation, however, is not a straightforward extension of existing methods. 

In stochastic optimization, MBCR is an extremely promising tool for value function approxima- 
tion. Many solution methods for sequential decision problems include value function approximation, 
such as point-based value iteration (Pineau et al. 2003), fitted Q-iteration ( Ernst et al.||2005 ), ap- 



proximate dynamic programming ( Powell||2007| ) . All of these methods involve iterative searches of 



an approximate value function over sets of feasible actions where. In many problems, such as re- 
source allocation, the value function is known to be convex. Robust multivariate convex regression 
methods would allow a wider variety of problems to be solved, including those with large action 
spaces and non-separable objective functions. 

Perhaps the most intriguing feature of MBCR is that it is a Bayesian model and can easily be 
combined with other Bayesian models to produce estimators that are convex in some dimensions, 
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but not all. For example, it is well known that consumer preferences for bundled products tend to be 
convex. However, it is likely that other covariates like consumer age, gender, income and education 
influence the preference function — and the function is not convex in these covariates. This set of 
functions could be well-modeled by a combination of MBCR and Bayesian mixture models like 



Dirichlet processes (Ferguson 1973, Antoniak 1974) or hierarchical Dirichlet processes (|Teh et al. 



2006). Such flexible models would be of great value to an assortment of fields, including economics, 



operations research and reinforcement learning. 
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Appendix A 



Appendix A contains the proofs for Section 3.1 



To show pointwise convergence, we use Theorems 1 to 3 of Choi & Schervish (2007); they 
are condensed for this paper. For the fixed design case, let Q n be the empirical density of the 
design points, Q n (?c) — Y17=i l{xi}( x )- The empirical density is used to define the following 
neighborhood, 



W e , n = |(/,(7) : f |/(x) - / (x)| dQ n ( X ) < e, 



a 



< e 



Choi & Schervish 



2001) Let Pjq^ denote the joint conditional distribution of 



Theorem 6.1. 

given the covariates, assuming that fo is the true mean function and is the true variance. 
If assumptions Al, A2, Bl and B2 are satisfied, then for every e > 0, 

^ 1 n{(/, ff )e^|F 1 ,...,y„,x 1 ,...,x n }40. 

For the random design case, let Q be the density of the random design points. Let 



U e = (f,a) : inf{6 > : Q({x : |/(x) - / (x)| > e}) < e}, 



a 



< e 



be the set of neighborhoods based on the in-probability metric. 



Theorem 6.2. (Choi & Schervish 2007) Let aQ denote the joint conditional distribution of 
{Yi} ( ^ 1 given the covariates, assuming that fo is the true mean function and is the true variance. 
If assumptions Al, A2, Bl and B2 are satisfied, then for every e > 0, 

Ff Qt(70 U {(/, a) € U° | Y U . . . , Y n , X 1; . . . , X n } 0. 



We now show that the prior satisfies assumptions Al and A2 for Theorems 6.1 and 6.2 



Lemma 6.3. For every S > 0, the prior II from B2 and B3 or B4 puts positive measure on the 
neighborhood 

5 < 5=l(/,<7 2 ):||/-/o||oc<5, 





<'} 




^0 
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Proof. Fix 5 > 0. Break B s into two parts, U(B s (p)) = {/ : ||/ - / ||oo < S}, and U(B s (a 2 )) 



4- - 1 



< (5 1 . Under a truncated inverse gamma prior, 

n(^(a 2 )) = n((a 2 (l - <S) 2 , a 2 (l + <S) 2 )) > 0. 



To show prior positivity on II (B§ (/?)), we create a sufficiently fine mesh over X. On each section 
of the mesh, we show that there exists a collection of hyperplanes that 1) do not intersect with /o, 
and 2) have an distance from fo of less than 5 in that section. Since fo is bounded on X, it is 
Lipschitz continuous with parameter L. A mesh size parameter 7 > 0, which depends on 5, can be 
found to make a 7 mesh over X with the following requirements. 

Number regions r = 1, . . . , R; call the subsets of the covariate space defined by the regions M? . 
Because fo is Lipschitz continuous, an 77 > can be found such that for every region r, one can 
find a r * and /3 r * where for every a r E [a r * — 77, a r * + rj\ and f3 r E [/3 r * — 77I, /3 r * + 7/1], 



a r + /3^x < /o(x), 



/o(x) - a r - /3^x < S 



for every x E M^. 

We create a function /^(x) to approximate /o by taking the maximum over the set of R hyper- 
planes; using the above, we can bound the distance between fo and f§. 



SUp ||/ (x) - fs(x)\\oo = SUp ||/ (x) 



max a r + BTx.\ 

re{l,...,R} 



< max sup ||/ (x) - a r - /3 r x||oo, 
r=1 >-~> R xeM? 

<6. 

To complete the proof, we note that U(K = R) > and II ([a r * —77, av* +77], [(3 r * — 77I, /3 r * +77I]) > 
forr=l,...,i?. □ □ 

Lemma 6.4. Define the prior II as in B2 and B3. There exist constants C\ > and ci > such 
that n(e£) < Cie- Cin . 

Proof. Without loss of generality, assume = [0, l] p . Note that 



e? n = e\ 

ex) k p 



< M n 



A/ 



< M n , j = 1, 



,P 



e UUU {/(■!») : * = *.iflwi2^} 
U{/h«^-^i^}. 

Taking the probability of the right hand side of Equation ( [To| ) , 

n(ef n ) < ± u k{ k = ^)E(n(N>^) + tn( 

<2E u [K](p + l) ^ -±=e~\ 



(10) 



1/3. 



•j,e\ 



> 



< Cie 



■cin 



□ 



□ 
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Note that under the bounded prior assumption B4, II(0£) = for sufficiently large n. Theorems 
3.1 and 3.2 follow directly from Theorems 4 and 6, respectively, of |Choi fc Schervish ( |2007| and 
Theorems 6J_ and |6.2| In the random design case, L\ convergence is equivalent to in-probability 
convergence under assumptions Bl and B6; the fixed design case requires more care. See |Choi~fc] 



Schervish (2007) for details. 



Appendix B 



Appendix B contains the proofs for Section |3.2[ Theorem |3.3| relies on verifying the conditions of 
Theorem 3 of Ghosal & van der Vaart ( |2007| ), 



Theorem 6.5 (Ghosal & van der Vaart (2007)). Let Pj be a product measure and d n a semimetric 
and let be the space of all {A, a, fj\ -tuples with positive measure under II. Suppose that for a 
sequence e n such that ne^ is bounded away from zero, all sufficiently large j and sets Q n C 0, 
the following conditions hold: 

(i) sup e>en log7V(e/18,{/ E Q n : d n (fj ) < e},d n ) < ne 2 n1 

(ii) There exist tests $ n such that ^%<S> n < e~\ nd ^ (/oJl) and - $ n ) < e~\ nd ^ (/oJl) for all 
feQ such that d n (f, fi) < jod n (f , /i); 



n(eg) 



o e 



-2nd 



(ill) n(B*(/ ,Cn)) 
/An,) n(f£®n:jen<dn(f,fo)<2j 

€n ) ^nei 7 2 /4 

( W J n(s*(/o,c n )) - 



where B*(f ,e n ) 
d n (fi fo) > H n e n 



= {/ e e : i Er=i ^(/o, /) < 4, £ Er=i Wo, /) < ^m, p^n(/ : 



OO. 



The distance metric, d n that we will use is the 



\ n norm. Note that the 1 1 • | \ n norm is bounded 
by the || • | |oo norm; we shall do metric entropy computations with respect to the || • | |oo norm. The 
values Ki(foJ) and Vi(f 0l f) denote / / log(/ //)d/i and / /o(log(/o//)) 2 d/i, respectively. The 
quantity in condition (i) is the log of the covering number of the sieve under the supremum norm. 
To show that conditions (i) to (iv) of Theorem |6.5| are met, we check them off one at a time while 
working in the linearly transformed space, X — {y : y = Ax, x E X}. 

Lemma 6.6. Define <d n = 0i n and suppose B4 holds. Then, 

<e},\\-\ 



I ) < Cf~ d / 2 

\oo) 

Proof. Working in the transformed space, assumption B4 places bounds on the supremum and 



suplogJV(e/18,{/€e n :||/-/ | 

e>e n 



partial derivatives for all / E 0; the result then follows directly from Theorem 2.7.10 of |van der 



Vaart fc Wellner| (1996) or Theorem 6 of ( |Bronshtem"| [l976 ) for V = 1. By setting e = e/V, setting 



/ = f/V, fo = fo/V an d calculating the metric entropy with respect to e, / and fo, the result 
holds. This covering needs to be repeated at most e~ p times to cover the original space; taking the 
log, plog(l/e) can be bounded by a constant times e~ d / 2 . □ □ 

Lemma 6.7. Define U by B4 and B7. Let 

{ 1 n 1 n 1 

mO= /ee : -£Wo,/)<4 -X)Wo,/)<c4 • 

l i=l i=l J 
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(/o(xi) - /(x,)) 2 



Then there exist C\ and c\ > such that 

U(B* n (f ,e n ))>C ie c ^ dl °^. 

Proof. By simple calculations, 

2cr u 

To place a lower bound on the prior measure of S*(/o,e n ), we construct a subset and place prior 
bounds on that. 

Let f3 E a truncated Gaussian prior on /? induces a truncated Gaussian prior on f3 = A/3, 
the slope parameters in the transformed space. WLOG, take X — [0, l] d . Set S = 8v ^ cr 2 e n', let 

yi ? • • • Jm be a 5— net over A\ The net can be chosen such that rn < K m /ef l for some constant 
K m that depends only on d and ctq. Let 

-(»<»),£»<»).... 

Then with a sufficiently large truncation parameter V, for every fe E {1, . . . , m}, 

c d+l 

t n 5 



for some K a > that depends on ci, ctq, A and go. Set g(y) = max^{ 1; m y + /^jfy. Then, 
^ (/o(xi) ~^(x,)) 2 < 4, so 

II(B;(/o,e n )) >U K (K = m) 



> Cie cie "' lloge ", 
for some constants Ci, ci > 0. 



□ 



□ 



We can use Lemma 6.7 to check conditions (Hi) and (iv) of Theorem 6.5 
Lemma 6.8. Define II by B4 and B7. Then for every large j, 

n(e£) 



n(B*(/ ,e n )) 
n(/ € 6 n : je n < ||/, /o| loo < 2je n ) 



^ 2 g-C2n-ci£„ loge„ 



< de- Cl£n 



U(B*(f ,e n )) 

Proof. The first equation can be bounded by using Lemma |6~H the second by setting the numerator 
equal to 1. □ □ 



Now we use this collection of Lemmas and Theorem |6.5| to prove Theorem |3.3 

lof of Thei 
from Lemma 



Proof of The orem\3.3\ We begin by checking the conditions of Theorem |6.5| Condition (i) fol lows 
Setting e" 1 = log(n) n l ^ dJr2 \ conditions (Hi) and (iv) follow from Lemma 



6.6 



6.8 



Finally, Birge ( 2006| ) shows that the likelihood ratio test for fo versus f\ satisfies condition (ii) 
relative to the || • || n norm under both fixed and random design. Therefore, the main result follows 
directly from Theorem 6.5 □ □ 
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Algorithm 1 Reversible Jump MCMC for MBCR 



Initialize (if , a, /3, a 2 ): set if = 1, draw (ai,/3i,a 2 ) from posterior 
loop 

Draw a new (if*, a*, a 2 ) from the proposal distribution 

Set (if, a, /3, a 2 ) to (if*, a*, a 2 *) with probability a(if*, a*, a 2 * | if, a, /3, a 2 ) 
end loop 



Appendix C 



The RJMCMC algorithm is similar to the ones proposed by Denison et al. ( 1998) and DiMatteo et al 



(2001) for BARS. Jumps in the chain can take three forms: additions, deletions and relocations. 



The probabilities of additions, deletions and relocations must satisfy detailed balance equations, 
H(if + 1, a\ :K+1 , a 2 :X+1 *) (11) 

X p(K, ai :K , Pl:K, °\:K I K + h a l:K+li P*:K+1 ? a l:K+i*) 
= n(if, ai :K , 01:K, °1:k)p( K + h <*1:K+1> Pl:K+l> a l:K+l I K i a l:K> 01:K, °1;k)- 



DiMatteo et al.| ( [2001 ) shows that Equation (11) is satisfied if additions, deletions and relocations 



are attempted with the following probabilities, respectively, 

. f p(k + l) \ . f p(k-l) \ 

b k cmmj 1, , 4 = cmmjl, j , r k = l- b k -d k , 

where is the prior probability of k hyperplanes and c is a constant; we set c = 0.4. 

Additions. Given the current state (if, a, /?), a new state with if +1 hyperplanes, (if +1, a*, /?*), 
is proposed with the jump probability, 

q(K + 1, Ol\. k+1 , Pt-K+liVl-.K+l I K i a l:K, Pl:K , &1 : k) = b K h b (a* , (J 2 * | Q, /3, CT 2 ). 

Here bx is the addition probability given if hyperplanes and hi, is the proposal distribution for 
additions. 

Deletions. A new state with if — 1 hyperplanes, (if — l 5 a*,/3*) is proposed with the jump 
probability, 

g(if - 1, ol*. K -ii Pt.K-ii G \-K-\ I K , a i:K, Pv.k, 0\-k) = d K h d (a\ (3\ a 2 * | a, /3, a 2 ). 

Here dx is the deletion probability given if hyperplanes and is the proposal distribution for 
deletions. 

Relocations. A new state with if hyperplanes, (if, a*, /?*) is proposed with the jump probability, 

q(K, a* :K , P* :K , g\,k I if, Pi :K , <tI :K ) = r K h r (oL*,l3*, a 2 * | a, /3, a 2 ). 

Here tk is the relocation probability given if hyperplanes and h r is the proposal distribution for 
relocations. The full RJMCMC algorithm is given in Algorithm [T] 
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Proposal Distributions 

Posterior inference for our model is particularly sensitive to the choice of proposal distributions. 
The space of potential hyperplanes is quite large and grows with p. In order to efficiently search 
that space, we use a collection of basis regions to create h d and h r . Basis regions are determined 
by partitioning the set of training data, as described in Equation Q. We show how this is done 
for relocations, deletions and additions. 

Relocations. In a relocation step, the proposal distribution is generated by the basis regions 
created by the current (a, That is, C r = {C[, . . . , C^}, where 

CI — < i : k — arg max aj + 07x 2 
k I 3={i,.,K} 3 Hj 

Then the proposal distribution is created using this partition as in Equation Q. 

Deletions. In a deletion step, the proposal distribution is a mixture of distributions generated 
by basis regions, 

K 

h d (a\ /3\a 2 * \ a, /3, a 2 ) = ^2p d (j)h^(a* , cr 2 * | OL- h a 2 _ j ), 

3=1 

where Ylf=i Pd(j) — 1 5 an d the — j subscript denotes the indices {1, . . . , j — 1, j + 1, . . . , K}. For j = 
1, . . . , if , the distribution hj(a*, /?*, a 2 * | a 2 •) is made by creating the following partition, 

C d,j = { C d^ ^ C^lj, where 

Ct ' J — \ i '. k — arg max old + dj^i \ . 

k I Hi^j-ij+i,..,^} J 

Likewise, the distribution h^(a*, a 2 * | a 2 ■) is defined by Equation (M). The component 

probability Pd(j) is set to be proportional to 1/|CJ|, the inverse of the number of components 
supported by hyperplane j. If no components are currently supported, we set Pd(j) 1/.25. 

Additions. As in the deletion step, the proposal distribution is a mixture of distributions gen- 
erated by basis regions. Unlike deletions, it is not obvious how to add a hyperplane in a way 
that will result in a high quality proposal. In an addition step, MBCR starts with a set of hyper- 
planes, (a, and adaptively adds an additional hyperplane in the following manner. The current 
hyperplanes (a, (3) define a partition over the observation space, C = {Ci, . . . , Ck}, where 

C k = ji e {1, . . . , n} : k = arg ^max^ a k + /^x* 

MBCR splits each element j = 1, . . . , K of C in turn along a direction defined by a linear combi- 
nation of covariates, producing a collection of new covariate partitions. A number of knots, L, is 
chosen a priori, along with M random linear combinations, (g™, . . . , 5™)m=i- Then, for each direc- 
tion m — 1, . . . , M, and each knot £ — 1, . . . , L, a covariate partition is generated in the following 
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manner, 

c b,j,£,m = I i . k = ^ + ^T x .^ j ^ k [ 

I {fc=i,-,*0 J 

= < i : J = arg max m. + x*, g *j <a J A , 

= U : J = arg max m. + x*, g *j > a J A , 

where are chosen to produce L + 1 intervals between min{g mT x^ : i £ Cj} and max{g mT x^ : 
i G C,-}. Set 

syb,j,£,m _ f syb,j,£ n b,j,t,m n bj/,m ^bj&m ^bj&m ^fej^ral 

^ — ' ' - * ' — 1 ' ' ' ' K / • 

Often, it is convenient to chose the cardinal directions, that is, g M = ej, as the linear combinations 
for the covariates. However, when p is large and a sparse underlying structure is assumed, it is 
useful to choose g m to be a random Gaussian vector with M < p. 

The observation partitions are used to produce the following mixture model for the addition 
proposal distribution, 

V L 

h b {a\ P\ a 2 * I a, /3, a 2 ) = £ J> 6 (j, £, m)h% (a\ /T, a 2 *\C b ^™), 

where the distribution /i£(a* 3 a 2 * | C J ^' m ) is defined Equation Q. The weights pb(j,£,m) are 
set to be proportional to 

where n^b m — |C^'^' m | and n^+' m = |C^+'^' m |. This gives higher weight to partitions that split a 
large number of observations fairly evenly. 
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